Mean Squared Logarithmic Error (MSLE)#

Mean Squared Logarithmic Error (MSLE) is a regression metric for non-negative targets. It computes mean squared error after a log transform (log(1 + y)), so it mainly measures relative / multiplicative differences.


Learning goals#

  • Define MSLE precisely (domain, weights, multioutput)

  • Build intuition for what MSLE does (and does not) penalize

  • Implement MSLE from scratch in NumPy

  • Visualize MSLE behavior across scales with Plotly

  • Use MSLE as a training objective via log-space linear regression

Prerequisites#

  • NumPy basics (vectorization, shapes)

  • Comfort with derivatives for least squares (or gradients)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import mean_squared_error, mean_squared_log_error

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition#

Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) predictions. MSLE is defined as:

\[ \mathrm{MSLE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n\Big(\log(1+y_i) - \log(1+\hat{y}_i)\Big)^2 \]

Equivalently, with the shorthand \(\log1p(t)=\log(1+t)\):

\[ \mathrm{MSLE}(y, \hat{y}) = \mathrm{MSE}(\log1p(y),\; \log1p(\hat{y})) \]

Domain constraints#

Because we evaluate \(\log(1+y)\), we must have:

\[ y_i \ge 0 \quad \text{and} \quad \hat{y}_i \ge 0 \]

(scikit-learn enforces non-negativity).

Weighted MSLE#

With non-negative sample weights \(w_i\):

\[ \mathrm{MSLE}_w(y,\hat{y}) = \frac{\sum_{i=1}^n w_i\,\big(\log1p(y_i)-\log1p(\hat{y}_i)\big)^2}{\sum_{i=1}^n w_i} \]

Multioutput#

For \(y,\hat{y}\in\mathbb{R}^{n\times d}\) you compute MSLE per output and then aggregate (uniform average or a weighted average).

Units / interpretation#

MSLE is dimensionless (it operates on log values). A common variant is:

\[ \mathrm{RMSLE} = \sqrt{\mathrm{MSLE}} \]

Because \(\log1p(\hat{y})-\log1p(y)=\log\frac{1+\hat{y}}{1+y}\), RMSLE is the root-mean-square of a log-ratio error.

# A tiny example
y_true = np.array([3.0, 5.0, 2.5, 7.0])
y_pred = np.array([2.5, 5.0, 4.0, 8.0])

msle = mean_squared_log_error(y_true, y_pred)
rmsle = float(np.sqrt(msle))

msle, rmsle
(0.03973012298459379, 0.19932416558108)

2) Intuition: squared error in log-space#

MSLE is just MSE applied to the transformed values \(z=\log(1+y)\). Two key consequences:

  1. Large values are compressed. Differences among big targets contribute less than they would under MSE on the original scale.

  2. MSLE mostly measures relative error. Using a first-order Taylor approximation:

\[ \log(1+y+\varepsilon) = \log(1+y) + \frac{\varepsilon}{1+y} + O(\varepsilon^2) \]

so for small errors \(\varepsilon = \hat{y}-y\):

\[ \log1p(\hat{y}) - \log1p(y) \approx \frac{\hat{y}-y}{1+y} \]

and therefore:

\[ \mathrm{MSLE}(y,\hat{y}) \approx \frac{1}{n}\sum_{i=1}^n\left(\frac{\hat{y}_i-y_i}{1+y_i}\right)^2 \]

which is a smoothed squared relative error (for large \(y\), \(1+y\approx y\); near zero, \(1+y\approx 1\) so MSLE behaves closer to MSE).

# Visualize the log1p transform and its inverse
y_grid = np.linspace(0, 2000, 600)
z_grid = np.linspace(0, np.log1p(y_grid.max()), 600)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Transform: z = log(1+y)", "Inverse: y = exp(z) - 1"),
)

fig.add_trace(go.Scatter(x=y_grid, y=np.log1p(y_grid), mode="lines"), row=1, col=1)
fig.add_trace(go.Scatter(x=z_grid, y=np.expm1(z_grid), mode="lines"), row=1, col=2)

fig.update_xaxes(title_text="y", row=1, col=1)
fig.update_yaxes(title_text="log(1+y)", row=1, col=1)
fig.update_xaxes(title_text="z", row=1, col=2)
fig.update_yaxes(title_text="exp(z)-1", row=1, col=2)

fig.update_layout(title="log1p compresses large values (expm1 reverses it)")
fig.show()

3) Multiplicative symmetry (ratios, not differences)#

A single-sample log error can be rewritten as a ratio:

\[ \log1p(\hat{y}) - \log1p(y) = \log\left(\frac{1+\hat{y}}{1+y}\right) \]

So the per-sample squared log error is:

\[ \big(\log\tfrac{1+\hat{y}}{1+y}\big)^2 \]

This means MSLE treats over- and under-prediction symmetrically in multiplicative terms: predicting a factor \(k\) too large (in \(1+y\)) has the same loss as predicting a factor \(k\) too small.

# Per-sample MSLE as a function of the ratio r = (1+y_pred)/(1+y_true)
r = np.geomspace(0.05, 20.0, 600)
sq_log_error = np.log(r) ** 2

fig = px.line(
    x=r,
    y=sq_log_error,
    log_x=True,
    title="Per-sample squared log error depends on the ratio (1+y_pred)/(1+y_true)",
    labels={"x": "ratio r = (1+y_pred)/(1+y_true)", "y": "(log r)^2"},
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")

# Show the symmetry: r and 1/r yield the same loss
for r0 in [2.0, 3.0, 5.0]:
    fig.add_trace(
        go.Scatter(
            x=[r0, 1.0 / r0],
            y=[np.log(r0) ** 2, np.log(1.0 / r0) ** 2],
            mode="markers",
            marker=dict(size=10),
            name=f"r={r0} and 1/r",
        )
    )

fig.show()

4) MSLE vs MSE across scales#

Suppose the model makes a constant relative error, e.g. it always over-predicts by 20%.

  • Under MSE, the per-sample contribution \((\hat{y}-y)^2\) grows like \(y^2\). Large targets dominate the metric.

  • Under MSLE, the per-sample contribution is approximately a constant (for large \(y\)) because it is driven by the ratio \((1+\hat{y})/(1+y)\).

The plot below shows this effect.

y_true = np.geomspace(0.1, 2000.0, 200)
ratio = 1.20
y_pred = ratio * y_true

sq_err_mse = (y_pred - y_true) ** 2
sq_err_msle = (np.log1p(y_pred) - np.log1p(y_true)) ** 2

target_log_ratio_sq = float(np.log(ratio) ** 2)

fig = make_subplots(rows=1, cols=2, subplot_titles=("MSE contribution", "MSLE contribution"))

fig.add_trace(
    go.Scatter(x=y_true, y=sq_err_mse, mode="markers", name="(y_pred - y_true)^2"),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=y_true, y=sq_err_msle, mode="markers", name="(log1p diff)^2"),
    row=1,
    col=2,
)

fig.add_hline(y=target_log_ratio_sq, line_dash="dash", line_color="black", row=1, col=2)

fig.update_xaxes(title_text="y_true", type="log", row=1, col=1)
fig.update_yaxes(title_text="squared error", type="log", row=1, col=1)

fig.update_xaxes(title_text="y_true", type="log", row=1, col=2)
fig.update_yaxes(title_text="squared log error", row=1, col=2)

fig.update_layout(
    title="Constant 20% relative error: MSE explodes with scale, MSLE stays (nearly) constant",
    showlegend=False,
)
fig.show()

5) NumPy implementation from scratch#

Below is a small NumPy implementation that mirrors the core ideas in sklearn.metrics.mean_squared_log_error:

  • supports 1D targets (n,) and multioutput targets (n, d)

  • optional sample_weight (weighted average over samples)

  • multioutput: 'raw_values', 'uniform_average', or output weights (d,)

It also enforces the MSLE domain constraint: no negative values.

def mean_squared_log_error_np(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput="uniform_average",
):
    """Compute MSLE with optional sample weights and multioutput support.

    Parameters
    ----------
    y_true, y_pred:
        Array-like with shape (n,) or (n, d). Must be non-negative.
    sample_weight:
        None or array-like with shape (n,)
    multioutput:
        - 'raw_values'        -> return array of shape (d,)
        - 'uniform_average'   -> return scalar (mean over outputs)
        - array-like (d,)     -> weighted average over outputs

    Returns
    -------
    float or np.ndarray
    """

    y_true_arr = np.asarray(y_true, dtype=float)
    y_pred_arr = np.asarray(y_pred, dtype=float)

    if y_true_arr.shape != y_pred_arr.shape:
        raise ValueError(f"Shape mismatch: y_true {y_true_arr.shape} vs y_pred {y_pred_arr.shape}")

    if np.any(y_true_arr < 0) or np.any(y_pred_arr < 0):
        raise ValueError("MSLE is only defined for non-negative y_true and y_pred")

    if y_true_arr.ndim == 1:
        y_true_arr = y_true_arr[:, None]
        y_pred_arr = y_pred_arr[:, None]
    elif y_true_arr.ndim != 2:
        raise ValueError("y_true/y_pred must be 1D or 2D")

    n_samples, n_outputs = y_true_arr.shape

    log_true = np.log1p(y_true_arr)
    log_pred = np.log1p(y_pred_arr)

    sq_log_error = (log_true - log_pred) ** 2  # (n, d)

    if sample_weight is not None:
        w = np.asarray(sample_weight, dtype=float)
        if w.ndim != 1 or w.shape[0] != n_samples:
            raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")

        denom = float(np.sum(w))
        if denom == 0.0:
            raise ValueError("sum(sample_weight) must be > 0")

        msle_per_output = np.sum(sq_log_error * w[:, None], axis=0) / denom
    else:
        msle_per_output = np.mean(sq_log_error, axis=0)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return msle_per_output
        if multioutput == "uniform_average":
            return float(np.mean(msle_per_output))
        raise ValueError(
            "multioutput must be 'raw_values', 'uniform_average', or array-like of output weights"
        )

    output_weights = np.asarray(multioutput, dtype=float)
    if output_weights.shape != (n_outputs,):
        raise ValueError(f"multioutput weights must have shape ({n_outputs},), got {output_weights.shape}")

    weight_sum = float(np.sum(output_weights))
    if weight_sum == 0.0:
        raise ValueError("sum(multioutput weights) must be > 0")

    return float(np.dot(msle_per_output, output_weights) / weight_sum)


# Quick check vs scikit-learn
y_true_2d = rng.uniform(0.0, 8.0, size=(30, 3))
y_pred_2d = np.clip(y_true_2d + rng.normal(scale=0.8, size=(30, 3)), 0.0, None)
w = rng.uniform(0.5, 2.0, size=30)

ours = mean_squared_log_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
sk = mean_squared_log_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")

ours, sk, np.allclose(ours, sk)
(array([0.0611, 0.0547, 0.058 ]), array([0.0611, 0.0547, 0.058 ]), True)

6) Gradient w.r.t. predictions (useful for optimization)#

For one sample, define the log residual:

\[ e_i = \log1p(\hat{y}_i) - \log1p(y_i) \]

MSLE is \(J = \frac{1}{n}\sum_i e_i^2\). Using the chain rule:

\[ \frac{\partial J}{\partial \hat{y}_i} = \frac{2}{n}\,e_i\,\frac{\partial}{\partial \hat{y}_i}\log1p(\hat{y}_i) = \frac{2}{n}\,\frac{\log1p(\hat{y}_i) - \log1p(y_i)}{1+\hat{y}_i} \]

Notice the factor \(\frac{1}{1+\hat{y}_i}\): gradients are naturally scaled like relative errors.

def msle_grad_y_pred(y_true, y_pred):
    """Gradient of MSLE w.r.t y_pred (1D arrays, uniform averaging)."""
    y_true = np.asarray(y_true, dtype=float)
    y_pred = np.asarray(y_pred, dtype=float)

    if y_true.shape != y_pred.shape:
        raise ValueError("shape mismatch")
    if np.any(y_true < 0) or np.any(y_pred < 0):
        raise ValueError("MSLE is only defined for non-negative values")

    e = np.log1p(y_pred) - np.log1p(y_true)
    return (2.0 / y_true.size) * e / (1.0 + y_pred)


# Finite-difference check
y_true = rng.uniform(0.0, 5.0, size=7)
y_pred = rng.uniform(0.0, 5.0, size=7)

g_ana = msle_grad_y_pred(y_true, y_pred)

eps = 1e-6
g_num = np.zeros_like(y_pred)
for i in range(y_pred.size):
    y_pred_p = y_pred.copy()
    y_pred_m = y_pred.copy()
    y_pred_p[i] += eps
    y_pred_m[i] -= eps
    y_pred_m[i] = max(0.0, y_pred_m[i])

    f_p = mean_squared_log_error_np(y_true, y_pred_p)
    f_m = mean_squared_log_error_np(y_true, y_pred_m)
    g_num[i] = (f_p - f_m) / (2 * eps)

g_ana, g_num, np.allclose(g_ana, g_num, rtol=2e-4, atol=2e-4)
(array([ 0.0186,  0.0384, -0.0793, -0.0236, -0.0056,  0.0618,  0.0478]),
 array([ 0.0186,  0.0384, -0.0793, -0.0236, -0.0056,  0.0618,  0.0478]),
 True)

7) Using MSLE to optimize a model: log-space linear regression#

A very common way to train a model for MSLE is to predict in log-space. Define:

\[ z_i = \log1p(y_i) \]

and fit a linear model in \(z\):

\[ \hat{z}_i = b_0 + b_1 x_i \]

Then convert back to the original scale with:

\[ \hat{y}_i = \exp(\hat{z}_i) - 1 \]

This guarantees \(\hat{y}_i \ge 0\), and the objective becomes:

\[ J(b_0, b_1) = \frac{1}{n}\sum_{i=1}^n (z_i - (b_0 + b_1 x_i))^2 \]

which is just least squares on \(z\) (a convex quadratic).

Gradients#

Let \(r_i = \hat{z}_i - z_i\). Then:

\[ \frac{\partial J}{\partial b_0} = \frac{2}{n}\sum_{i=1}^n r_i \qquad \frac{\partial J}{\partial b_1} = \frac{2}{n}\sum_{i=1}^n x_i r_i \]

We can minimize this with gradient descent (or solve it in closed form).

# Synthetic data with multiplicative / log-normal noise
n = 260
x = rng.uniform(0.0, 6.0, size=n)

b0_true = 0.8
b1_true = 0.9
sigma = 0.35

z_true = b0_true + b1_true * x
z_obs = z_true + rng.normal(0.0, sigma, size=n)
y = np.expm1(z_obs)  # y >= 0

# Visualize in original space vs log space
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Data in original y scale", "Same data in z = log(1+y) scale"),
)
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="y"), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=np.log1p(y), mode="markers", name="log1p(y)"), row=1, col=2)

fig.update_xaxes(title_text="x", row=1, col=1)
fig.update_yaxes(title_text="y", row=1, col=1)

fig.update_xaxes(title_text="x", row=1, col=2)
fig.update_yaxes(title_text="log(1+y)", row=1, col=2)

fig.update_layout(title="Log space often reveals a simpler (linear) relationship")
fig.show()
def fit_line_msle_gd(x, y, *, lr=0.05, n_steps=400):
    """Fit y_hat = exp(b0 + b1*x) - 1 by minimizing MSLE via GD in log-space."""
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if np.any(y < 0):
        raise ValueError("y must be non-negative for MSLE")

    z = np.log1p(y)
    n = x.size

    b0 = float(z.mean())
    b1 = 0.0

    history = {"step": [], "msle": [], "b0": [], "b1": []}

    for t in range(n_steps):
        z_hat = b0 + b1 * x
        r = z_hat - z
        msle_t = float(np.mean(r**2))

        grad_b0 = (2.0 / n) * np.sum(r)
        grad_b1 = (2.0 / n) * np.sum(x * r)

        b0 -= lr * grad_b0
        b1 -= lr * grad_b1

        history["step"].append(t)
        history["msle"].append(msle_t)
        history["b0"].append(b0)
        history["b1"].append(b1)

    return (b0, b1), history


(b0_gd, b1_gd), hist = fit_line_msle_gd(x, y, lr=0.06, n_steps=350)

# Closed-form least squares in z-space
X = np.column_stack([np.ones_like(x), x])
z = np.log1p(y)
b0_ols, b1_ols = np.linalg.lstsq(X, z, rcond=None)[0]

(b0_gd, b1_gd), (b0_ols, b1_ols)
((0.7382724858458961, 0.9098627244134785),
 (0.7381944920965867, 0.9098824642871552))
# Training curve
fig = px.line(
    x=hist["step"],
    y=hist["msle"],
    title="Gradient descent minimizing MSLE (in log space)",
    labels={"x": "step", "y": "MSLE"},
)
fig.show()
# Compare MSLE-trained model to an MSE-on-y linear baseline
b0_mse, b1_mse = np.linalg.lstsq(X, y, rcond=None)[0]

x_line = np.linspace(x.min(), x.max(), 300)
X_line = np.column_stack([np.ones_like(x_line), x_line])

y_pred_msle = np.expm1(b0_ols + b1_ols * x)
y_line_msle = np.expm1(X_line @ np.array([b0_ols, b1_ols]))

y_pred_mse = b0_mse + b1_mse * x
y_line_mse = X_line @ np.array([b0_mse, b1_mse])
y_pred_mse_clip = np.clip(y_pred_mse, 0.0, None)

metrics = {
    "MSE (baseline linear)": mean_squared_error(y, y_pred_mse),
    "MSE (MSLE-fit model)": mean_squared_error(y, y_pred_msle),
    "MSLE (baseline linear, clipped)": mean_squared_log_error(y, y_pred_mse_clip),
    "MSLE (MSLE-fit model)": mean_squared_log_error(y, y_pred_msle),
}
metrics
{'MSE (baseline linear)': 8020.024141455282,
 'MSE (MSLE-fit model)': 3214.396742264166,
 'MSLE (baseline linear, clipped)': 1.0271819896986292,
 'MSLE (MSLE-fit model)': 0.128488167880701}
# Visual comparison
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", opacity=0.65))
fig.add_trace(go.Scatter(x=x_line, y=y_line_msle, mode="lines", name="MSLE-fit: exp(linear)-1"))
fig.add_trace(go.Scatter(x=x_line, y=y_line_mse, mode="lines", name="MSE baseline: linear"))

fig.update_layout(
    title="MSLE-fit model tracks multiplicative structure; MSE-linear baseline can underfit small values",
    xaxis_title="x",
    yaxis_title="y",
)
fig.show()
# Loss landscape in (b0, b1) for the log-space objective (convex bowl)
b0_grid = np.linspace(b0_ols - 0.8, b0_ols + 0.8, 140)
b1_grid = np.linspace(b1_ols - 0.8, b1_ols + 0.8, 140)

B0, B1 = np.meshgrid(b0_grid, b1_grid)
residuals = z[None, None, :] - (B0[:, :, None] + B1[:, :, None] * x[None, None, :])
Z = np.mean(residuals**2, axis=2)

stride = max(1, len(hist["step"]) // 120)
b0_path = np.array(hist["b0"])[::stride]
b1_path = np.array(hist["b1"])[::stride]

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=b0_grid,
        y=b1_grid,
        z=Z,
        contours_coloring="heatmap",
        colorbar=dict(title="MSLE"),
    )
)
fig.add_trace(
    go.Scatter(
        x=b0_path,
        y=b1_path,
        mode="lines+markers",
        name="GD path",
        marker=dict(size=4, color="black"),
        line=dict(color="black"),
    )
)
fig.add_trace(
    go.Scatter(
        x=[b0_ols],
        y=[b1_ols],
        mode="markers",
        name="OLS optimum",
        marker=dict(symbol="x", size=10, color="white", line=dict(color="black", width=2)),
    )
)

fig.update_layout(
    title="MSLE landscape for log-space line: convex bowl + gradient descent trajectory",
    xaxis_title="b0",
    yaxis_title="b1",
)
fig.show()

8) Practical usage (scikit-learn)#

Scoring#

MSLE is available as:

from sklearn.metrics import mean_squared_log_error

In model selection, scikit-learn uses the (negated) scorer string:

neg_mean_squared_log_error

Training for MSLE#

A common pattern is to train a regressor on log1p(y) and invert with expm1. In scikit-learn, TransformedTargetRegressor provides this neatly.

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression

X_1d = x[:, None]

reg = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=np.log1p,
    inverse_func=np.expm1,
)
reg.fit(X_1d, y)
y_pred = reg.predict(X_1d)

mean_squared_log_error(y, y_pred)
0.12848816788070097

Pros, cons, and when to use MSLE#

Pros#

  • Relative-error focus: penalizes multiplicative differences (good for targets spanning orders of magnitude)

  • Less dominated by large targets than MSE on the original scale (log compresses large values)

  • Handles zeros (thanks to log(1+y))

  • Symmetric for over/under prediction in ratio terms: \(r\) and \(1/r\) have the same per-sample loss

Cons#

  • Requires non-negative values: cannot be used if \(y\) (or predictions) can be negative

  • The +1 matters: near zero, MSLE behaves more like MSE and depends on the unit/scale implied by the +1

  • Can hide large absolute errors on large targets (by design)

  • Less directly interpretable than MAE/RMSE in the original units

Good use cases#

  • Demand/count forecasting, sales, traffic, inventory: targets are non-negative and often long-tailed

  • Problems where a 20% miss on a small value should matter similarly to a 20% miss on a large value

  • Situations where you expect multiplicative noise (log-normal style)


Common pitfalls / diagnostics#

  • If your model can output negatives (e.g., plain linear regression), MSLE will error; consider:

    • predicting in log-space and using expm1, or

    • using a non-negative link function (e.g., softplus) or clipping (clipping changes the objective).

  • Plot residuals both in the original scale and in log-space; MSLE can look good while large absolute errors remain.

Exercises#

  1. Show that MSLE is exactly MSE on \(z=\log1p(y)\).

  2. Modify mean_squared_log_error_np to return RMSLE (square root) as an option.

  3. Create a dataset with additive noise instead of multiplicative noise and compare models trained with MSE vs MSLE.

References#

  • scikit-learn: sklearn.metrics.mean_squared_log_error

  • Log transform modeling pattern: sklearn.compose.TransformedTargetRegressor